AITopics | language model benchmark

Collaborating Authors

language model benchmark

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance?

Banerjee, Sourav, Agarwal, Ayushi, Singh, Eishkaran

arXiv.org Machine LearningDec-2-2024

The pursuit of leaderboard rankings in Large Language Models (LLMs) has created a fundamental paradox: models excel at standardized tests while failing to demonstrate genuine language understanding and adaptability. Our systematic analysis of NLP evaluation frameworks reveals pervasive vulnerabilities across the evaluation spectrum, from basic metrics to complex benchmarks like GLUE and MMLU. These vulnerabilities manifest through benchmark exploitation, dataset contamination, and evaluation bias, creating a false perception of progress in language understanding capabilities. Through extensive review of contemporary evaluation approaches, we identify significant limitations in static benchmark designs, human evaluation protocols, and LLM-as-judge frameworks, all of which compromise the reliability of current performance assessments. As LLM capabilities evolve and existing benchmarks become redundant, we lay the groundwork for new evaluation methods that resist manipulation, minimize data contamination, and assess domain-specific tasks. This requires frameworks that are adapted dynamically, addressing current limitations and providing a more accurate reflection of LLM performance.

benchmark, large language model, natural language, (17 more...)

arXiv.org Machine Learning

2412.03597

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Industry:

Information Technology (0.46)
Education > Assessment & Standards > Student Performance (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

COGNET-MD, an evaluation framework and dataset for Large Language Model benchmarks in the medical domain

Panagoulias, Dimitrios P., Papatheodosiou, Persephone, Palamidas, Anastasios P., Sanoudos, Mattheos, Tsoureli-Nikita, Evridiki, Virvou, Maria, Tsihrintzis, George A.

arXiv.org Artificial IntelligenceMay-17-2024

Large Language Models (LLMs) constitute a breakthrough state-of-the-art Artificial Intelligence (AI) technology which is rapidly evolving and promises to aid in medical diagnosis either by assisting doctors or by simulating a doctor's workflow in more advanced and complex implementations. In this technical paper, we outline Cognitive Network Evaluation Toolkit for Medical Domains (COGNET-MD), which constitutes a novel benchmark for LLM evaluation in the medical domain. Specifically, we propose a scoring-framework with increased difficulty to assess the ability of LLMs in interpreting medical text. The proposed framework is accompanied with a database of Multiple Choice Quizzes (MCQs). To ensure alignment with current medical trends and enhance safety, usefulness, and applicability, these MCQs have been constructed in collaboration with several associated medical experts in various medical domains and are characterized by varying degrees of difficulty. The current (first) version of the database includes the medical domains of Psychiatry, Dentistry, Pulmonology, Dermatology and Endocrinology, but it will be continuously extended and expanded to include additional medical domains.

cognet-md, dataset, medical domain, (11 more...)

arXiv.org Artificial Intelligence

2405.10893

Country:

Europe > Greece > Attica > Athens (0.05)
North America > United States > Virginia (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.50)

Industry:

Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
Health & Medicine > Therapeutic Area > Neurology (0.95)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Supervised Learning and Large Language Model Benchmarks on Mental Health Datasets: Cognitive Distortions and Suicidal Risks in Chinese Social Media

Qi, Hongzhi, Zhao, Qing, Song, Changwei, Zhai, Wei, Luo, Dan, Liu, Shuo, Yu, Yi Jing, Wang, Fan, Zou, Huijing, Yang, Bing Xiang, Li, Jianqiang, Fu, Guanghui

arXiv.org Artificial IntelligenceNov-1-2023

In the realm of social media, users frequently convey personal sentiments, with some potentially indicating cognitive distortions or suicidal tendencies. Timely recognition of such signs is pivotal for effective interventions. In response, we introduce two novel annotated datasets from Chinese social media, focused on cognitive distortions and suicidal risk classification. We propose a comprehensive benchmark using both supervised learning and large language models, especially from the GPT series, to evaluate performance on these datasets. To assess the capabilities of the large language models, we employed three strategies: zero-shot, few-shot, and fine-tuning. Furthermore, we deeply explored and analyzed the performance of these large language models from a psychological perspective, shedding light on their strengths and limitations in identifying and understanding complex human emotions. Our evaluations underscore a performance difference between the two approaches, with the models often challenged by subtle category distinctions. While GPT-4 consistently delivered strong results, GPT-3.5 showed marked improvement in suicide risk classification after fine-tuning. This research is groundbreaking in its evaluation of large language models for Chinese social media tasks, accentuating the models' potential in psychological contexts. All datasets and code are made available.

cognitive distortion and suicidal risk, language model benchmark, mental health dataset, (2 more...)

arXiv.org Artificial Intelligence

2309.03564

Genre: Research Report (0.40)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.53)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.44)

Add feedback

LAiW: A Chinese Legal Large Language Models Benchmark (A Technical Report)

Dai, Yongfu, Feng, Duanyu, Huang, Jimin, Jia, Haochen, Xie, Qianqian, Zhang, Yifang, Han, Weiguang, Tian, Wei, Wang, Hao

arXiv.org Artificial IntelligenceOct-9-2023

With the emergence of numerous legal LLMs, there is currently a lack of a comprehensive benchmark for evaluating their legal abilities. In this paper, we propose the first Chinese Legal LLMs benchmark based on legal capabilities. Through the collaborative efforts of legal and artificial intelligence experts, we divide the legal capabilities of LLMs into three levels: basic legal NLP capability, basic legal application capability, and complex legal application capability. We have completed the first phase of evaluation, which mainly focuses on the capability of basic legal NLP. The evaluation results show that although some legal LLMs have better performance than their backbones, there is still a gap compared to ChatGPT. Our benchmark can be found at URL.

application, llm, zhang, (16 more...)

arXiv.org Artificial Intelligence

2310.0562

Country:

Asia > China > Sichuan Province > Chengdu (0.05)
Asia > China > Hubei Province > Wuhan (0.05)
Asia > China > Shanghai > Shanghai (0.04)
(2 more...)

Genre:

Research Report (0.70)
Overview (0.46)

Industry: Law (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback